Pilot-Streaming: Design Considerations for a Stream Processing Framework for High-Performance Computing

نویسندگان

  • Andre Luckow
  • Peter M. Kasson
  • Shantenu Jha
چکیده

Streaming capabilities are becoming increasingly important for scientific applications [1], [2] supporting important needs, such as the ability to act on incoming data and steering. The interoperable use of streaming data sources within HPC environments is a critical capability for an emerging set of applications. Scientific instruments, such as x-ray light sources (e. g., the Advanced Photon Source (APS) and the Advanced Light Source (ALS) [3]), can generate large amounts of highvelocity data in a diverse set of experiments. Coupling data streams produced by such experiments to computational HPC capabilities is an important challenge. Supporting the processing of high data rates streams executing, e. g., predictions and outlier detection algorithms on it, while running larger models in batch mode on the entire dataset, is a challenging task. The increasing demands lead to a heterogeneous landscape of infrastructures and tools supporting streaming needs on different levels. Batch frameworks, such as Spark [4] have been extended to provide streaming capabilities [5], while different native streaming frameworks, such as Storm [6] and Flink [7] have emerged. We define a streaming application as an application that processes and acts on real-time data, also referred to as event stream. Different usage modes for stream processing can be observed: • Coordination: Usage of stream processing to connect a data source and data analysis phase. Sometime this includes the pre-processing and transformation of the data before it becomes persistent (e. g. the Hadoop Filesystem (HDFS)) and analyzed (e. g. using a Hadoop processing framework). • Realtime Analytics: This type of application utilizes machine learning on incoming data, e. g. for scoring, classification or outlier detection. • Analytics and Model Update: The applications combine stream processing with other forms of processing, e. g. the continuous update of a machine learning model on historical data and real-time scoring/classification or a simulation. The complexity of the application increases from the top to bottom. In the last application type streaming, batch and interactive processes utilizing different abstractions and runtime systems need to be combined, algorithms need to be adapted to process windows of data instead of the complete bounded data-set. There are several challenges that need to be address when developing streaming applications: • Infrastructure: How to efficiently handle data streams (delivery guarantees, low latencies, varying data rate)? How to decouple data producer and consumer? How to store data to allow flexible stream and batch processing (event stream as a log, random access and mutable storage)? • Abstractions: How to decouple application concerns from streaming infrastructures? How can high-level abstraction, such as SQL oder data pipelines be efficiently supported in streaming mode? • Applications: The application itself needs to be capable of utilizing streaming data. Often, the algorithm needs to be adapted to meaningful incorporate incoming data. While batch algorithms assume that they operate on a complete dataset, streaming applications need to operate on a window of data (e. g. a fixed, sliding or session window) [8]. Thus, it is often necessary to balance historical and recent data in machine learning algorithms using e.g. decay factors. Simulations e. g. need an algorithm for including streaming data into their current state.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Pilot-Streaming: A Stream Processing Framework for High-Performance Computing

An increasing number of scientific applications rely on stream processing for generating timely insights from data feeds of scientific instruments, simulations, and Internet-of-Thing (IoT) sensors. The development of streaming applications is a complex task and requires the integration of heterogeneous, distributed infrastructure, frameworks, middleware and application components. Different app...

متن کامل

A Data-Synchronous Event Model for GNU Radio

In this paper, we present a synchronous event stream model overlaid on the existing GNU Radio streaming dataflow model. GNU Radio has long utilized a traditional static streaming dataflow model to interconnect modular signal processing blocks. While this model fits many radio and signal processing applications well, GNU Radio and the applied signal processing community face several pressing nee...

متن کامل

Design and Test of the Real-time Text mining dashboard for Twitter

One of today's major research trends in the field of information systems is the discovery of implicit knowledge hidden in dataset that is currently being produced at high speed, large volumes and with a wide variety of formats. Data with such features is called big data. Extracting, processing, and visualizing the huge amount of data, today has become one of the concerns of data science scholar...

متن کامل

Toward High-Performance Distributed Stream Processing via Approximate Fault Tolerance

Fault tolerance is critical for distributed stream processing systems, yet achieving error-free fault tolerance often incurs substantial performance overhead. We present AF-Stream, a distributed stream processing system that addresses the trade-off between performance and accuracy in fault tolerance. AF-Stream builds on a notion called approximate fault tolerance, whose idea is to mitigate back...

متن کامل

Benchmarking Distributed Stream Data Processing Systems

The need for scalable and efficient stream analysis has led to the development of many open-source streaming data processing systems (SDPSs) with highly diverging capabilities and performance characteristics. While first initiatives try to compare the systems for simple workloads, there is a clear gap of detailed analyses of the systems’ performance characteristics. In this paper, we propose a ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016